Search CORE

196 research outputs found

Representation Policy Iteration

Author: Mahadevan Sridhar
Publication venue
Publication date: 04/07/2012
Field of study

This paper addresses a fundamental issue central to approximation methods for solving large Markov decision processes (MDPs): how to automatically learn the underlying representation for value function approximation? A novel theoretically rigorous framework is proposed that automatically generates geometrically customized orthonormal sets of basis functions, which can be used with any approximate MDP solver like least squares policy iteration (LSPI). The key innovation is a coordinate-free representation of value functions, using the theory of smooth functions on a Riemannian manifold. Hodge theory yields a constructive method for generating basis functions for approximating value functions based on the eigenfunctions of the self-adjoint (Laplace-Beltrami) operator on manifolds. In effect, this approach performs a global Fourier analysis on the state space graph to approximate value functions, where the basis functions reflect the largescale topology of the underlying state space. A new class of algorithms called Representation Policy Iteration (RPI) are presented that automatically learn both basis functions and approximately optimal policies. Illustrative experiments compare the performance of RPI with that of LSPI using two handcoded basis functions (RBF and polynomial state encodings).Comment: Appears in Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI2005

arXiv.org e-Print Archive

ScholarWorks@UMass Amherst

Manifold Alignment using Procrustes Analysis

Author: Mahadevan Sridhar
Wang Chang
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2008
Field of study

In this paper we introduce a novel approach to manifold alignment, based on Procrustes analysis. Our approach di®ers from \semi- supervised alignment in that it results in a mapping that is de¯ned everywhere { when used with a suitable dimensionality reduction method { rather than just on the training data points. We describe and evaluate our approach both theoretically and experimen- tally, providing results showing useful knowl- edge transfer from one domain to another. Novel applications of our method including cross-lingual information retrieval and trans- fer learning in Markov decision processes are presented

CiteSeerX

ScholarWorks@UMass Amherst

Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension

Author: Deng Yichuan
Mahadevan Sridhar
Song Zhao
Publication venue
Publication date: 10/04/2023
Field of study

Large language models (LLMs) have shown their power in different areas. Attention computation, as an important subroutine of LLMs, has also attracted interests in theory. Recently the static computation and dynamic maintenance of attention matrix has been studied by [Alman and Song 2023] and [Brand, Song and Zhou 2023] from both algorithmic perspective and hardness perspective. In this work, we consider the sparsification of the attention problem. We make one simplification which is the logit matrix is symmetric. Let

n

denote the length of sentence, let

d

denote the embedding dimension. Given a matrix

X \in \mathbb{R}^{n \times d}

, suppose

d \gg n

and

\| X X^\top \|_{\infty} < r

with

r \in (0,0.1)

, then we aim for finding

Y \in \mathbb{R}^{n \times m}

(where

m\ll d

) such that \begin{align*} \| D(Y)^{-1} \exp( Y Y^\top ) - D(X)^{-1} \exp( X X^\top) \|_{\infty} \leq O(r) \end{align*} We provide two results for this problem.

\bullet

Our first result is a randomized algorithm. It runs in

\widetilde{O}(\mathrm{nnz}(X) + n^{\omega} )

time, has

1-\delta

succeed probability, and chooses

m = O(n \log(n/\delta))

. Here

\mathrm{nnz}(X)

denotes the number of non-zero entries in

X

. We use

\omega

to denote the exponent of matrix multiplication. Currently

\omega \approx 2.373

\bullet

Our second result is a deterministic algorithm. It runs in

\widetilde{O}(\min\{\sum_{i\in[d]}\mathrm{nnz}(X_i)^2, dn^{\omega-1}\} + n^{\omega+1})

time and chooses

m = O(n)

. Here

X_i

denote the

i

-th column of matrix

X

. Our main findings have the following implication for applied LLMs task: for any super large feature dimension, we can reduce it down to the size nearly linear in length of sentence

arXiv.org e-Print Archive

An Over-parameterized Exponential Regression

Author: Gao Yeqi
Mahadevan Sridhar
Song Zhao
Publication venue
Publication date: 29/03/2023
Field of study

Over the past few years, there has been a significant amount of research focused on studying the ReLU activation function, with the aim of achieving neural network convergence through over-parametrization. However, recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions, specifically in the attention mechanism. Mathematically, we define the neural function

F: \mathbb{R}^{d \times m} \times \mathbb{R}^d \rightarrow \mathbb{R}

using an exponential activation function. Given a set of data points with labels

\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\} \subset \mathbb{R}^d \times \mathbb{R}

where

n

denotes the number of the data. Here

F(W(t),x)

can be expressed as

F(W(t),x) := \sum_{r=1}^m a_r \exp(\langle w_r, x \rangle)

, where

m

represents the number of neurons, and

w_r(t)

are weights at time

t

. It's standard in literature that

a_r

are the fixed weights and it's never changed during the training. We initialize the weights

W(0) \in \mathbb{R}^{d \times m}

with random Gaussian distributions, such that

w_r(0) \sim \mathcal{N}(0, I_d)

and initialize

a_r

from random sign distribution for each

r \in [m]

. Using the gradient descent algorithm, we can find a weight

W(T)

such that

\| F(W(T), X) - y \|_2 \leq \epsilon

holds with probability

1-\delta

, where

\epsilon \in (0,0.1)

and

m = \Omega(n^{2+o(1)}\log(n/\delta))

. To optimize the over-parameterization bound

m

, we employ several tight analysis techniques from previous studies [Song and Yang arXiv 2019, Munteanu, Omlor, Song and Woodruff ICML 2022]

arXiv.org e-Print Archive